Thanks for the info. I wasn't aware that there was a free TI compiler, so it certainly makes sense to look at that first! It doesn't worry me too much that the TI compiler will be a lot slower than TCC, since for our machine the critical "supervisor" code will be flashed in.
Also the thread on programming was interesting - I actually used some of those techniques to "link" to functions in the supervisor thread, except I use a big chunk at the end of the gather buffer instead of persist vars.
I wasn't really thinking of an optimizer, but just allowing "manual" optimization using the C "register" storage class. That would not have been an excessive amount of work; I think the improvement could be quite dramatic. Register ops are 5ns vs. 30ns for memory accesses.
The code already optimizes FPGA access, but it was the load and store of axis positions which was a bit of a killer. The first code attempt was overly general; I subsequently cut it down to just what was needed, but I could do more. For example, the index pulses only need to be found at init time, so it could obviously include a flag which enables/disables those particular inputs. So yes, you are right that all 9 inputs don't need constant monitoring. It was simpler to just monitor them all the time. Well, as Einstein was reputed to say, things should be as simple as possible, but no simpler.
The loop time of 1ms is because I am doing it in the main supervisor loop, which is continually doing a bunch of other stuff, such as handling the I2C bus master emulation, responding to interlocks and fault conditions, baby-sitting the spindle and so on.